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Abstract 

We propose a variational method to solve all three estimation problems for nonlinear 
stochastic dynamical systems: prediction, filtering, and smoothing. Our new approach is 
based upon a proper choice of cost function, termed the effective action. We show that this 
functional of time- histories is the unique statistically well-founded cost function to determine 
most probable histories within empirical ensembles. The ensemble dispersion about the 
sample mean history can also be obtained from the Hessian of the cost function. We show 
that the effective action can be calculated by a variational prescription, which generalizes the 
"sweep method" used in optimal linear estimation. An iterative numerical scheme results 
which converges globally to the variational estimator. This scheme involves integrating 
forward in time a "perturbed" Fokker-Planck equation, very closely related to the Kushncr- 
Stratonovich equation for optimal filtering, and an adjoint equation backward in time, 
similarly related to the Pardoux-Kushner equation for optimal smoothing. The variational 
estimator enjoys a somewhat weaker property, which we call "mean optimality" . However, 
the variational scheme has the principal advantage — crucial for practical applications — that 
it admits a wide variety of finite-dimensional moment-closure approximations. The moment 
approximations are derived reductively from the Euler-Lagrange variational formulation and 
preserve the good structural properties of the optimal estimator. 

'Permanent address: Department of Mathematics, University of Arizona, Tucson, AZ 85721 



Introduction 



The three classical problems of stochastic estimation are prediction, filtering, and smoothing 
of time series; e.g. see [0. These correspond to estimating the future, present, and past states, 
respectively, based upon current available information. In more detail, the nonlinear estima- 
tion problem may be described as follows: assume as known some nonlinear (Ito) stochastic 
differential equation for a time-series X(t): 

dX = f (X, t)dt + (2D) 1/2 (X, t)dW(t). (0.1) 

Here f is a (drift) dynamical vector, D is a nonnegative diffusion matrix, and W(i) is a vector 
Wiener process. Suppose also that some imperfect observations r(t) are taken of a function 
Z(K(t),t) of the basic process, including some measurement errors p(t) with covariance R(t): 

r(t) = Z(X(t),t)+p(t). (0.2) 

It will generally be assumed that the distribution of the measurement errors is known as well. 
For example, the errors may be assumed to be proportional to a white noise: p(t) = H}/ 2 (t)rj(t). 
Then the problem is, given the data 1Z(tf) = {r(i) : t < tf} up to final measurement time tf, 
to obtain the best estimate of X(t) at times t > tf, t = tf and t < tf. 

The optimal filtering problem in the above general setting has been exactly solved by 
Stratonovich ||] and Kushner |3], ||| within a Bayesian formulation. Those authors have shown 
that the conditional probability density V(x.,t\lZ(t)), given the current data lZ(t), obeys a 
stochastic partial differential equations, nowadays called the Kushner- Stratonovich equation. 
Explicitly, denoting V*(x.,t) = V(x.,t\H(t)), the KS equation is of the form 

d t V*(x, t) = L(i)P*(x, t) + h T {t)[Z(x, t) - (Z{t))* t }V*{x, t). (0.3) 

where 

L{t) = - V x -[f (x, €){■)] + V x <g> V x :[D(x, t)(-)) (0.4) 
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is the standard Fokker-Planck linear operator, and 

h(t) = R- 1 (t)[r(t)-(Z(t)U] (0.5) 

is a random forcing term constructed from the particular realization of the observation r(t) 
obtained in a given sample run of the system. Note that (-)*t = E(-\H(t)) denotes conditional 
average, i.e. the average with respect to the distribution V(x,t\lZ(t)) itself. Hence, the KS 
equation is nonlinear. The integration of this equation forward in time, with sequential input of 
the fresh observations r(t) as they become available, solves, in principle, the filtering problem. 
The prediction problem is then solved in theory, by integrating the standard Fokker-Planck 
equation with V(x,tf\TZ(tf)) as initial data to obtain V(x.,t\1Z(tf)) for t > tf. 

The optimal smoothing problem has also been solved, in principle, by Kushner || and 
Pardoux ||. They have shown that V(x.,t\H(tf)) for t <tf can be written as 

V(x,t\K(t f )) = A*(x,t)V*(x,t), (0.6) 

where V*(x.,t) is as above and »4*(x, t) solves the adjoint equation 

$A(x,t) + L*(t)A(x,t) +h T (t)[Z(x,t) - (Z(t))* t }A*(x,t) = 0. (0.7) 

This equation, with the random forcing h(i), must be interpreted as a "backward stochastic 
equation". It is solved subject to the final condition A(x,tf) = 1. 

In certain cases, these estimators reduce exactly to solving a finite number of ODE's. For 
example, in the linear case, where f(x, t) = A(t)x, D(x, t) = D(i), and Z(x,t) = B(t)x, the 
KS optimal filter reduces exactly to the finite-dimensional Kalman-Bucy optimal linear filter 
0. This reduction occurs in the linear case because the conditional PDF is known rigorously 
to be multivariate Gaussian, uniquely specified by its mean and covariance. The conditional 
mean J5[X(t)|72.(i)] coincides with the Kalman-Bucy filter estimate of the current state 
X(t), which is determined by the solution of a stochastic ODE with sequential input of the 
observations. The covariance matrix C(t) = E[X(t)X T (t)\K(t)] -£(i)£ T (i) is obtained as well 
from a linear Ricatti equation integrated forward in time. 
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The Pardoux-Kushner smoother is also finite-dimensional for linear systems. In fact, it 
coincides there with an alternative variational formulation of the linear estimation problem. 
The latter can be motivated most naively from the idea of least-square-error estimation. That 
is, one may introduce a weighted square-error functional for the dynamics, 

T x [x] = \ f tf dt [x - A(i)x] T D _1 (t)[x - A(t)x], (0.8) 

4 hi 

which measures the "cost" for a history x(t) to depart from the solution of the linear, deter- 
ministic dynamics x = A(t)x. The integral is weighted by the "error covariance" D(i) which 
arises from the random noise. A similar cost function may be introduced for the observation 
error of the data: 

r R [p] = \ dt p(t) T n- l {t)p{t). (0.9) 

In that case, the solution to the estimation problem may be obtained by minimizing with respect 
to x the combined cost function 

r X)ii [x,r] := T x [x] + T R {v - Bx] (0.10) 

when the set of observations {r(t) : ti < t < tf} is input into the second term. The minimizer 
x* = x» [r] is then the optimal history, which solves simultaneously all three estimation problems. 
It may be shown that x*(i) = £(t) for t > tf, so that the variational estimator coincides with 
the Kalman-Bucy filter and predictor. Furthermore, it may be shown that, for t < tf, the 
variational estimator is given by the Ansatz 

x*(t) = £(i) + C(t)a(t). (0.11) 

Here, (x(t) is the solution of a linear adjoint equation integrated backward in time with the final 
condition a(tf) = and thus vanishes identically for t >tf. However, it makes a contribution 
for t < tf to the smoother, proportional to C(t). This adjoint algorithm to calculate the mini- 
mizer is called the "sweep method" ||, and it gives the same result x*^) = E [X(t)\lZ(t f)] as 
calculated by the Pardoux-Kushner equation. Hence, it provides a finite-dimensional represen- 
tation of the optimal smoother for linear systems. 
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In general, however, the optimal estimators are infinite-dimensional, i.e. they require the 
solution of (stochastic) PDE's. For many of the spatially-extended, continuum systems of 
greatest interest in geophysics and in engineering, this is, in fact, a functional PDE. Even 
discretization for numerical solution results in a (stochastic) PDE on a phase space of dimension 
literally a billion or more. It has therefore been clear since their original formulation that, for 
such spatially-extended or distributed systems, the exact calculation of the optimal nonlinear 
estimator will be numerically unfeasible. Kushner himself wrote an early paper |]], in which he 
stressed this point and set up a formalism for approximating the optimal filter. As he observed 
there, the problem is formally the same as the "closure problem" in turbulence theory. The 
approximation scheme he proposed was also the same as that traditionally adopted in turbulence 
theory: namely, a moment closure of the full KS equation. Such a scheme results in a set 
of equations with a number of variables comparable to that in the starting equation ( p.l[ ), 
which may still be large but tractable. Constructing finite-dimensional approximate estimators 



continues to be a pressing research problem up to the present day, e.g. see [1C]. Indeed, general 
approximation schemes for the full estimation problem (prediction, filtering and smoothing) 
that are at once computationally practicable and faithful to the optimal solution remain to be 
developed. The fact that the problem of estimation for extended systems is formally equivalent 
to the turbulence problem — a notoriously difficult one — suggests that the solution here, too, 
will be nontrivial. Not only must the formal properties of the optimal estimator be retained by 
any approximation, but also the physical properties of the underlying dynamical system must 
be sufficiently represented. The problem of approximating the optimal estimator is not, in our 
opinion, just one of mathematics but also of physics. 

The aim of this paper is to formulate a new approach to the problem of optimal nonlinear 
estimation, based upon a variational formulation. The crux of the method is to identify an 
action functional, analogous to ( |0.8| ), which is statistically justified to use as a cost function 
for estimation of nonlinear dynamics. This is the quantity which we have called the effective 
action in previous works |ll]-p3||. This functional of state histories is uniquely characterized as 



that which selects the most probable value under arbitrary conditions on the empirical sample 
averages. In fact, the cost function (0.8) appropriate for linear systems has been motivated 



only rather crudely but it has a more fundamental probabilistic justification. It was apparently 
first observed by the chemist Lars Onsager that the dynamical cost function ( p.g| ) is the unique 
functional whose minimum determines the statistically most probable time-history of a linear 
dynamics of form ( p.l[ ), subject to an arbitrary sets of constraints. In the statistical physics 
literature, the functional ( |0.8| ) is known as the Onsager-Machlup action []l4|]. The cost function 
(p.9| ) for the current observation error can be similarly shown to give the most probable error in 
the case of a Gaussian white- noise distribution. The combined cost function flO.10 ) — under the 



assumption that dynamical noise and observation error are independent random functions — 
then indeed gives by minimization the most probable time-history subject to currently available 
information. Thus, the minimizer x*[r] is the unequivocal optimal estimator in the linear case. 
Previous attempts to develop variational methods for optimal nonlinear estimation have not 
paid sufficient attention to the statistical requirements on the cost function. For example, a 
functional has often been employed similar to ([T^) for the linear case, 

r x [x] = i [ tf dt [jc-f(x,t)] T T>-\t)[±-f(yL,t)}, (0.12) 

naively based upon least-square-error philosophy. However, the use of this cost function has no 
statistical justification, except in the weak-noise limit D —* 0. In that case, ( 0.12j ) is known as 



the "nonlinear Onsager-Machlup action" and it is proved to give the leading-order asymptotics 
of probabilities of time histories for small noise fTs! , [l6[| . However, except for the weak-noise 
limit or for linear dynamics, the Onsager-Machlup action has no probabilistic significance. Only 
the effective action — and no other cost function, such as ( p. 12 ) above — will even have as its 



minimum the correct mean value. The effective action thus plays the role of a "fluctuation 
potential" in the theory of empirical ensemble averages constructed from independent samples, 
analogous to the Onsager-Machlup action for weak-noise or for linear systems. In fact, the 
effective action is known to coincide with the Onsager-Machlup action for weak-noise or for 
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linear systems |L5|. Thus, the optimal estimator proposed in this work coincides with the 
standard ones for those special cases. 

Unlike ( p.8| ), the effective action cannot generally be written as an explicit function of 
the state histories. However, it has been shown in [11]-[13] that it may be calculated by a 



constrained variational method. The Euler-Lagrange equations that result are a pair of forward 
and backward equations, very similar to the Kushner-Stratonovich-Pardoux (KSP) equations. 
Despite this, the variational estimator is not quite equivalent to the optimal estimator which 
follows from the KSP equations. It possesses instead a property that we call mean-optimality. 
Although somewhat weaker than the optimality enjoyed by KSP, mean-optimality distinguishes 
it from other "suboptimal" estimators which have in fact no optimality whatsoever. However, 
the main advance of the variational approach is in the problem of constructing finite-dimensional 
approximations. Because the variational estimator is based upon an Euler-Lagrange variational 
principle, it is very easy to develop consistent approximations by a Rayleigh-Ritz scheme. In 
this method, parameterized trial functions are selected to represent the solutions of the forward- 
backward equations. Inserted into the variational functional and varying over parameters, one 
obtains approximations to the exact forward-backward equations and thereby to the effective 
action. A straightforward use of this scheme in fact leads to a moment-closure approximation 
for the forward filtering equation, much like that originally proposed by Kushner ||. However, 
now also backward equations are obtained for the smoothing problem. Necessary consistency 
properties with the forward equations are guaranteed by the fact that these arise together as 
the Euler-Lagrange system of an approximate action functional. 

This paper is organized as follows: in Part I, we present our variational formulation of 
optimal estimation. We first explain the unique statistical significance of the effective action, 
which makes it appropriate for variational estimation. We review there also the definition 
and properties of the effective action, including the notions of joint and conditional effective 
actions. The optimality property will be established for the variational estimator and compared 
with that of the KSP estimator. We next discuss how to calculate the effective action based 



upon its variational characterization. An iterative numerical scheme is outlined to numerically 
calculate the variational estimator, which reduces to solving KSP-type equations. Some matters 
important for practical applications will finally be discussed: the case when measurements are 
taken, not continuously, but at a discrete set of times, and the evaluation of the ensemble 
dispersion around the most probable value of the sample mean. In Part II the very important 
issue is addressed of constructing finite-dimensional approximations to the variational estimator, 
crucial for application of the methods to spatially-extended (or distributed) systems with many 
degrees of freedom. A Rayleigh-Ritz moment-closure scheme is developed, based upon the finite- 
dimensional reduction of the nonequilibrium action. The use of this approximation scheme for 
solution of practical estimation problems is finally discussed. 

I Variational Formulation of Optimal Estimation 

I.l. Ensemble Theory of Estimation 

There are intrinsic limits to our ability to estimate, which can be understood most simply from 
an ensemble point of view. If the stochastic dynamics fl0.1| ) is run many times with different 
realizations of the noise or, even in the deterministic case D(t) = 0, if the initial data are 
selected randomly from some starting distribution then the solutions will be generally 

quite distinct. Thus, in N different trials there will be N different outcomes Xi(i), ...,Xtv(£)- 
It is therefore not obviously very meaningful to give a single value x*(i) as an estimate of the 
state given some partial information 1Z (unless, of course, that information included the exact 
initial data or realization of the random noise!) It is true that the average over samples will 
converge to the mean in the ensemble conditioned on the available information: 



However, the individual sample points will show a scatter, possibly quite large, about this mean 




n=l 
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value. A useful measure of this scatter is the covariance matrix 

C n (t) := E[5X(t)6X. T (t)\K] (1.2) 

in the ensemble conditioned on 1Z, where 5X(t) := X(i) — E[K(t)\TZ]. In particular, TrC-ft.(i) := 
-E[||(5X(t)|| 2 |7*!.] gives the mean square radius cr\(t) of scatter of the sample points around the 
mean. The ensemble mean has the one virtue that it minimizes this rms radius of scatter. In 
other words, if one took <5X(i) := X(i) — x*(i) for any other non-random estimator x*(i) 7^ 
£7[X(t)|7£], one would increase cr%(t). This is an elementary fact of probability theory: for any 
random variable, the expectation value is the unique deterministic estimator for which the mean- 
square error is a minimum. This important property of the mean value as a predictor — that it 
minimizes rms forecast error — has been emphasized before by Leith in the field of climatology 
pi| . Of course, the above considerations show that one should have not only an estimate of 
the state of a system, but also an estimate of the reliability or certainty of that state. The 
covariance matrix C-ji(t) is a good such measure. Any state within a few standard deviations 
(Tu{t) of the mean must be regarded as having a good degree of probability to occur. 

Such considerations are precisely those which justify the standard Bayesian approach of 
Kushner-Stratonovich-Pardoux. Granted the limitations implied above, one cannot do better 
than to give the probability density "P(x, t\R) of the state variable conditioned on the available 
information. The minimal requirement on a variational approach to estimation is thus that it 
should give at least the mean and covariance of such conditioned ensembles. This has motivated 
us to consider a cost function, the "effective action" , which has a proper foundation in the theory 
of empirical ensembles. A brief review of its definition and basic properties is here required. 
1.2. Basic Theory of the Effective Action 



The quantity which we have termed the effective action [11]- [13] has appeared, in various 
guises and by various names, in quantum field theory, in theory of stochastic processes, and in 
dynamical systems theory. We shall here just briefly recapitulate its definition and properties. 
One interpretation of the effective action is as a generating functional for multi-time cor- 



relations. This is the way in which the functional is generally introduced in field theory |17| . 
Consider any vector- valued random process Z(i). Then, the cumulant generating functional 
Wz[h] is defined as 

W z [h] = log(exp (jf*' dt h T (i)Z(t))>. (1.3) 

The nth-order multi-time cumulants of Z(t) are obtained from W^[h] by functional differenti- 
ation with respect to the "test history" h(i): 

S n W z [h] 



Cii---in (^1 ' ■••) tn) 



(1.4) 

h=0 
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It is not hard to check from its definition ( |Q| ) that W^[h] is a convex functional of h. The 
Legendre dual of this functional is defined to be the effective action of Z(i): 

r z [z]=max{<h,z>-iy z [h]}, (1.5) 

h 

with < h, z >:= J dt h T (t)z(i). It is a generating functional of so-called irreducible correlation 
functions of Z(i): 

(1.6) 



r (t t ) — 6nTz[z] 



Sz^ih) ■ ■ ■ 5z in (t n ) z= s' 

The functional derivatives here are evaluated at the mean history z(t) := (Z(i)). It is not hard 
to check from the definition ( ]I.5[ ) that Tz[z] is a convex, nonnegative functional with a unique 
global minimum (equal to zero) at the mean history z = z. 

The effective action has another important interpretation as the rate function in the theory 
of large deviations of empirical ensemble averages for time-series. See pl| for the original, so- 



called Cramer theory of single real variables and IS] for the extension to general vector spaces. 



This theory involves the empirical or sample mean: 

1 N 

Z N (t):=^Y, Z n(t), (1-7) 

n=l 

where Z n (i) for n = 1, ...,N are independent, identically distributed realizations of the random 
process Z(t). The law of large numbers states that, in the limit of number of samples N going 
to infinity, Zj\r(i) - ► z(t). However, for finite N, Zjy(t) is itself a random process with some 
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probability of achieving a fluctuation value z(t) different from the ensemble mean z{t). The 
basic result of the Cramer theory is that this probability decreases exponentially in the limit 
as N — ► oo: 

P (z N (t) « z(t) : ti < t < t f ) ~ exp (-JV • r z [z]) . (1.8) 

Thus, r z [z] for z 7^ z gives the rate of decay of the probability to observe Ztv ~ z. Since 
r 'z [z] = only for z = z, the probability to observe the empirical iV-sample mean Z n equal to 
anything other than the ensemble mean z must go to zero as iV — > oo. Thus, the large deviation 
result ( p8|) includes, and generalizes, the usual law of large numbers. Furthermore, the ensemble 
mean is now also seen to be characterized by a variational principle of least effective action. 
That is, the most probable value of the sample mean for large N, or z = z, is just that which 
minimizes the effective action Tz[z\. It is worth emphasizing that the effective action is the 
unique function possessing all of these properties. This is a consequence of a general theorem 
on uniqueness of rate functions for large deviations |20t| . 

There is one other general theorem of large deviation theory which will prove important to 
us in the sequel. This is the so-called Contraction Principle. Suppose that Wjy(i) is a random 
process which is defined as a continuous functional W of an empirical mean Zjy(i): 

W N (t):=W[t;Z N }. (1.9) 

Then, Wjv(i) also satisfies a large deviations principle with rate function given by 

fV[w]= min T z [z}. (1.10) 

{z: W [zj=w| 

See p0[] . When the functional W is linear, then obviously fV[ w ] = rjy[w]. 

The joint effective action Fxy[x,y] of a pair of random time series X(t) and Y(t) can be 
defined most simply as the effective action Tz[z] of the composite vector Z(f) := (X(£), "Y(t)). 
Of course, this notion may be extended to a joint effective action Tx 1 -x n [xi, x n ] of an 
arbitrary number n of variables Xj(t), i = 1, ...,n. It is a simple application of the Contraction 
Principle to see that elimination of one of the variables is accomplished by minimizing over its 
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possible values. For example, 

r x [x] = minify [x,y] (1.11) 

recovers the effective action Tx[x] of X(t) alone. 

The joint effective action of a pair of independent time series X(t) and Y(i) is easily shown 
from the definition to be given just by the sum: 

r Xi y[x, y ] =r x [x] + ry[y]. (1.12) 

In general, for dependent time series, one may define the notion of a conditional effective action 
by means of 

r X | y [x|y] :=rXy[x,y]-IY[y]. (1.13) 

Thus, when X(t) and Y(t) are independent, Tx|y[x|y] = Tx[x]. The term "conditional action" 
is justified by the relation to large-TV asymptotics of conditional probabilities for empirical 
averages: 

P (Xjy ~ x|Yjv = y) ~ exp (-JV • r X | y [x|y]) . (1.14) 
Using the definition of the conditional probability, (1.14) is a simple consequence of the ba- 



sic large deviation estimate (1.8). The conditional action is also a generating functional for 



irreducible multi-time correlation functions in the conditioned ensemble, in the limit N — > oo. 
1.3. Proposal for the Variational Estimator 

These considerations motivate our following proposal: we propose to take as optimal estima- 
tor x*[r] the minimizer of the conditional effective action Tx\n[x\r] of the history x given the 
current observation history {r(i) : t E [t{,tf]}. From our discussion of the effective action in 
the previous section, we can infer the crucial property of this estimator: it is the mean value 
within the subensemble in which the empirical iV-sample average takes on the value r, that is, 
the sub-ensemble in which 

rjv(i) = r(i), t G [ti,t f \. (1.15) 

In fact, as discussed above, the conditional effective action is a variational functional whose 
minimum coincides with the subensemble mean history for an arbitrary set of constraints on 
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the empirical sample average. Thus, the estimator x*[r] is exactly of the form i£[X(f)|7£] 
discussed above, with TZ specified by ( p. 15 ) . It is also possible to obtain from the conditional 



effective action the error covariance C*[i|r], essentially by evaluating its Hessian matrix (see 
Section 1.6). This is precisely the ensemble dispersion that would be observed via the spread 
of sample histories in an ensemble forecasting scheme pl| , if the ensemble considered were the 
one specified by the condition ( p. 15 ) for the limit of large N. 



Our proposal is clearly similar in spirit to the Bayesian formulation of Kushner-Stratonovich- 
Pardoux. However, they are distinct. The difference can best be understood by considering the 
problem from an experimental point of view. Suppose that a very large number N of samples 
of the system (1) are run, drawing initial conditions randomly from the same distribution 
Vq. Then the conditional distribution V(x, t\lZ(tf)) considered by KSP corresponds to the 
very small sub-ensemble in which that particular realization r (non-random) of the observation 
history occurred. It obviously difficult to prepare such sub-ensembles, since one must wait 
patiently for the particular observation r to spontaneously occur and, each time it does, add it 
as a member to the sub-ensemble. This makes it very difficult to directly test the predictions 
of the KSP-equations. More to the point, it is very difficult to carry out an ensemble or 
Monte Carlo approach to calculate directly the conditional average. The variational estimation 
method proposed above corresponds to a different — and somewhat larger — sub-ensemble. As 
noted above, it corresponds to considering the sub-ensemble specified by the condition {rjv(i) = 
r(i), t £ [ti,tf]}. It is clear that the sub-ensemble described by the KSP-equations is, also, a 
subset of the new, larger one. In fact, if it is true as in the KSP sub-ensemble that r n = r in 
every realization, n = 1, N then it is a fortiori true that r^r = r. However, there will clearly 
be many members of the new ensemble in which Tn = r but for which not every term r n of 
the iV-sample average is equal to r. Thus, the new sub-ensemble is clearly much larger than 
that considered by Kushner-Stratonovich-Pardoux, but, still, too small a subset of the whole 
ensemble to be reproducible by direct methods. 

Despite the difficulty of directly testing the KSP-equations, they are truly the optimal 
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method for the filtering and smoothing problems. While it is difficult to prepare the conditional 
ensemble corresponding to a given observation r, there is no difficulty in preparing one member 
of such an ensemble. After all, just running the system once and collecting one observation 
r provides one realization in which that particular observation occurs! In fact, it is exactly 
this type of situation which occurs in practical prediction problems, such a meteorology. One 
has no control over which particular weather pattern will be observed up to today, but, given 
the one that has occurred, one would like to predict tomorrow's weather. The best predictor 
will be that corresponding to the conditional ensemble in which all of the available information 
is used. The variational predictor we have proposed corresponds to a larger sub-ensemble, 
which means that somewhat less detailed information about the system is used in making the 
prediction. Therefore, the variational predictor is optimal, but in a somewhat weaker sense than 
the KSP one. The variational estimator x* [r] is optimal given the data just on empirical sample 
averages, which is somewhat less than the information one actually possesses. We shall refer 
to the weaker optimality property possessed by the variational estimator as mean-optimality. 
In the language of statistical physics, the variational estimator could be termed a "mean- field 
approximation" to the optimal one, since it exploits conditions defined only through the sample 
mean. How different as predictors are the variational and KSP optimal estimators will depend 
upon how much variation occurs with the larger sub-ensemble. If all the terms r n ~ r in the 
sample average whenever tn = r for a given r, then there will be little difference between the 
two subensembles. This may be expected to occur whenever there are "preferred paths" in the 
dynamical evolution. 

In the linear case, the variational method proposed above coincides with the standard one 
described in the Introduction. To see this is true it is enough to point out the well-known 
fact that, for a linear dynamics, the effective action fx[x] coincides with the Onsager-Machlup 



action [15]. Hence, in the linear case, the variational estimator coincides with the optimal 
estimator of Kushner-Stratonovich-Pardoux, so far as the problems of prediction and filtering 
are concerned. This will not be true in general for nonlinear systems. As we shall see in the 
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next subsection, there is nevertheless a close formal connection between the Bayesian approach 
of Kushner-Stratonovich-Pardoux and the variational approach. 
1.4 Calculation of Effective Action & Variational Estimator 

It remains to consider how the effective action and its minimizer, the variational estimator, may 
actually be calculated. It was shown in [11, 13] that the effective action may be obtained from 



a constrained variational formulation. We shall here briefly review the results of those works 
and then explain how to obtain the minimizing history x* [r] itself. 

Consider any Markov times series X(£) and Z(t) := Z(X.(t),t) given by a continuous func- 
tion -Z(x, t). In this general context there is a useful variational characterization of the effective 
action Tz[z\. To explain this result, we must introduce a few notations. Because the process 
X(t) is Markov, its distribution V(x, t) at time t is governed by the forward Kolmogorov equation 

d t V(x,t) =L(t)V(x,t), (1.16) 

with L(t) the instantaneous Markov generator. The diffusion process governed by the stochastic 
equation (|0.1|) is a particular example, for which the generator is the Fokker-Planck operator 
defined in fl0.4| ). Observables, or random variables, -4(x, t) evolve under the corresponding 
backward Kolmogorov equation 

d t A(yi, t) = -L*{t)A{x, t), (1.17) 

in which L*(t) is the adjoint operator of L{t) with respect to the canonical bilinear form on 
L°° x L , i.e. < A,V >: = Jdx A(x)V('k). The backward and forward Kolmogorov equations 
may be simultaneously obtained as Euler-Lagrange equations for stationarity of the action 
functional 

T[A,V] := f tf dt < A(t), (d t - L(t))T(t) > (1.18) 
when varied over V € L 1 with initial condition V{ti) = Vq and A G L°° with final condition 

A(t f ) = l. 
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For the above situation, the effective action of Z(i) := Z(X.(t),t) has been shown |TT|, [l3|] 
to be obtained by a constrained variation of the action r[.4, V]- In fact, 

r z [z] = st.pt. AiV T[A,V] (1.19) 

when varied over the same classes as above, but subject to constraints of fixed overlap 

<A(t),V(t) >= 1 (1.20) 

and fixed expectation 

< A(t),Z(t)V(t) >=z(t) (1.21) 

for all £ G Note that Z(t) is used to denote the operator (in both L 1 and L°° ) of 

multiplication by Z(x, i). The Euler-Lagrange equations for this constrained variation may be 
obtained by incorporating the expectation constraint fll.21 ) with a Lagrange multiplier h(i). 



The overlap constraint could also be imposed with a Lagrange multiplier w(t). However, it 
turns out to be preferable to impose it through the definitions 

A(t) = l + [B(t)-(B(t)) t ] 

:= l + C(t), (1.22) 

with the final conditions B(tf) = C(tf) = 0. Note that (B(t))t ■=< B(t),V(t) > is the expecta- 
tion with respect to the distribution V{t). Hence, the overlap constraint ( |L20 ) is satisfied when 



B(t) is varied independently of V(t). Like A(t), the variable C(t) is not independent of V(t), 
but must satisfy the orthogonality condition < C(t),V(t) >= 0. We shall mostly make use here 
of the original variable A(t) rather than C(t), but the latter will play an important role in our 
formulation of moment-closures in Part II. 

Although obtained by varying over B(t),V(t), the Euler-Lagrange equations are most use- 
fully written instead in terms of the original variables ^4(t),"P(t): 

dtP(t) = L(t)V(t) + h T (t)[Z(t) - (Z(t)) t ]V(t) (1.23) 
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and 

8 t A(t) + L*(t)A(t) + h T (t)[Z(t) - (Z(t)) t ]A(t) = 0. (1.24) 

The calculation via B(t) has allowed the Lagrange multiplier to be evaluated explicitly, as 
w(t) = h T (t)(Z(t))t- The effective action r^[z] evaluated at a specific history z(i) is now 
obtained from the solutions of ( p.23 ),( L24| ) by substituting them back into the action functional 



r[.A, in ( p.!8[ ), when the "control field" h(i) is chosen so that ( JI.21[ ) reproduces the considered 
history z(i). It is not accidental that the same notation h(i) was chosen above as for the 
argument of the cumulant generating functional W^[h]. In fact, it can be shown that also 

W z [h] = f tf dth T (t)(Z(t)) t , (1.25) 

using just the solution V(t) of the forward equation ( |L23| ) for the control history h(t) which 
appears as the argument of Wz- For more details, see |ll|, |13|] . It should not have escaped 
the attention of the reader that the forward equation ( I.23| ) is very similar to the Kushner- 



Stratonovich equation ( p.3| ) for the conditional distribution V*(t) = V(t\TZ(t)) and that the 
backward equation ( |1.24 ) is likewise similar to the Kushner-Pardoux equation ( p.7| ) for A.*(t) = 



V(t\lZ(tf))/V(t\lZ(t)). This observation will be developed below. (See also Appendix 1.) 

Having completed our review of established results, we now consider how to calculate the 
variational estimator. It is helpful to observe that the minimizer x*[r] of rx|fl[x|r] over x 
with r fixed is the same as of Tx,r[x, r], the joint action of x and r. For simplicity, the 
observation errors will be assumed to be white-noise in time and independent of the dynamical 
noise. Another important simplifying assumption we shall make here is that the function of the 
process which is observed is linear: 

Z(x,t) = B(t)x. (1.26) 

We postpone to later the consideration of the general case, which is somewhat more complicated 
but no different in principle. By our assumptions, the joint action is given as 

r x ^[x,r] = r x [x] + ~ f f dt [r(t) - B(t)x(t)] T R- 1 (t)[r(t) - B(t)x(t)]. (1.27) 
2 J tl 
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The second term is r#[p] given in ( |0.9| ). We abbreviate r*[x] := Tx,r[*-, r] and its functional 

jr. I 

<5x(i) I 



derivative as k*[t;x] = ^ffy[x]. It is an easy calculation, using the expression ( |1.27| ), to show 



that 

k*[i;x] =k[t;x] +B T (t)R- 1 (t)[B(t)x(t) -r(t)], (1.28) 

with k[t;x] := [x] . Observe that we are using here the notation k(t) for the control 
associated to X(t), whereas we reserve h(t) for the control field associated to Z(i). What 
makes finding the minimizer x*[r] less trivial is the fact that Tx[x] and k[t; x] are not calculable 



directly, but only as the result of another optimization problem, like that in Eq.(1.5): fx[x 



maxk {< x, k > — Wx[k]} . Thus, the problem to be solved is really of minimax type: 

r*[x*[r]] = mmmaxjT^r - Bx]+ < x, k > -W x \k]} • (1-29) 

Numerical schemes to obtain the minimizer x*[r] must thus address this minimax problem. 
The simplest approach conceptually is to reformulate it as a double minimization, i.e. 

rv[x*[r]] = min jr fl [r - Bx] - mm{W x [k}- < x, k >}| . (1.30) 

In this case, each of the minimizations may be carried out in nested fashion, via any of the 
common iterative methods. For example, a conjugate gradient (CG) algorithm applied to the 
outer problem will produce a sequence x^ n ) converging as n — > oo to, at least, a local minimum 
x* of r*[x]. We mention conjugate gradient only as an example of an iterative scheme to 
find the minimum of a convex function, which requires as its input at each step the gradient 
ki"^(t) = ,5x(f) [ x ^]- Any such scheme requiring the gradient might be used instead. From 
( I.28[ ) such algorithms require knowing k[x^ n ^]. Conveniently, this is exactly what is obtained 



from the solution of the inner problem, since k[x( n )] is the unique minimizer k( n ) of the convex 
functional W^[k] := W*[k]- < k, X ( n ) > . This inner minimization problem may also be 
attacked by a CG-type method, noting that the gradient is 

— [ k ]=x[t;k]-xW(i). (1.31) 
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This gradient is now directly calculable via formula ([.21 ) above for a given k(f). Each evaluation 



of x[k] by ( |I. 21 ) requires one forward and one backward integration over the time interval [t$, tf]. 



A CG-type method applied to [k] will then produce a sequence k( n < m ) which converges to 
y-(n) = kJxC")] as m ^ oo. This inner minimization thus provides the gradient k^ n ) required for 
the nth CG step of the first minimization. To initiate the algorithm, one must specify x(°) and 
k(°.°). For this purpose, one may, for example, set A = 1 as a first approximation in ( |L21[ ). 
This gives 

xW(t) = (X(t)> t (1.32) 
and, from the equation k„,[£;x(°)] = 0, the first guess 

k (o,o) (t) = B T ( t ) R -i ( t ) [ r ( t ) _ B(t)x(t)] (1.33) 



If ( p.33j ) is substituted into the forward equation ( I.23| ), the latter may be integrated with 



sequential input of the observations r(t). Thence, both x(°) and k^ 0,0 ) are determined. At 
each successive stage one may take k( n+1 '°) = k( n ) to find the gradient k( n+1 ) for the (n + l)st 
CG step. This entire procedure can be regarded as a nonlinear generalization of the "sweep 
method" pi used to find the minimizer of the Onsager-Machup action ( |0.8| ). 

While this method has the advantage of conceptual simplicity, it suffers numerically from 
loss of precision and computational inefficiency. It is well-known in numerical optimization 
that minimizers are in general obtained to only half the precision of the minimum values them- 
selves. As it is the outside minimizer which is of direct interest here, the double minimization 
algorithm requires working in a precision quadruple to that desired for the optimizing history. 
Furthermore, the nested algorithm requires the square of the number of iterations as for a single 
minimization. It is thus advantageous to reformulate the minimax problem in terms of a single 
numerical minimization. This can be easily accomplished by rewriting it as 

r*[x*[r]] = min^r - Bx[k])]+ < x[k],k > -W x [k]} . (1.34) 

k 



(We thank M. Anitescu for this observation.) Note again that x[t; k] is given directly by (1.21) 



via one integration each of the forward and backward Kolmogorov equations over the time 
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interval t/]. The result of this single minimization is a control field k*[r], which then yields 
the desired optimal history x*[r] as x[k*[r]]. The only disadvantage of this formulation is that 
the gradient of the functional in brackets in ( I.34| ), 



G x [k,r] :=r fl [r-Bx[k]]+ <x[k],k> -W x [k], (1.35) 



is 

5G X n , /■*' , , 5x 



Kr] = j f dt' ^y[t';k] [k(0+R-V)(B(Ox[^k]-r(i'))] • (1.36) 



<5k(i) L ' J 4 «5k(i)' 
This expression involves 



* x 62 Wx rui /TQ7^ 



<5k(t) L ' J *k(t)<Sk(tf) 1 
the Hessian of the dual functional Wx[k]. Thus, this 2nd-derivative must be evaluated and 
stored for use. The storage issue is nontrivial for spatially-extended or distributed systems, 
because the Hessian then involves a number of elements of the order of the spacetime grid 
squared. However, these problems can be overcome. First, there are efficient direct and adjoint 
algorithms for calculating higher-order derivatives, such as Hessians, in addition to those for 
first derivatives. For example, see p3[ , Chapter 7, and also |13[| . Second, it is not really the 
Hessian itself which must be stored but only its matrix products with certain vectors, those 
in ( I.36| ). Hence, storage requirements can be reduced in intelligent schemes to vectors of the 



same order as required for the double minimization algorithm. We give further details of such 
algorithms elsewhere, which we regard as the most promising numerical implementations of our 
estimation method. 

Whichever of these iterative optimization methods is employed, T* [x] is a convex functional, 
and the iterates will therefore converge to the global minimizer x* [r] . Observe that the zeroth- 
order of the double iteration scheme coincides formally with the KSP equations ( p.3[ )-(|0?^). In 
fact, it is then easy to see that equation ( |I.33| ) for k(t) at zeroth-order reduces to 

k(t) =B T (t)h(t), (1.38) 
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with h(i) given precisely by ( |0.5| ). Substituting this value, the forward-backward equations 
in our iterative scheme reduce in form to the KSP equations ( |0.3] ),(|0"7^). In general, there is 
no reason to believe (except for linear dynamics), that the variational filter and KS filter will 
coincide. However, one may hope that the variational estimator, acting as a filter, is not too far 
from the optimal KS filter. The formal coincidence of these two in the case of linear observations 
at the start of the iterative construction provides possibly a convenient algorithmic approach 
to assess the differences. We emphasize, however, the word "formal" in this context, because 
the variational equations ( |1.23| ), (1.24), while appearing in form identical to the KSP equations 



j|),( |0.7| ), have a quite different mathematical interpretation. Whereas the control field h(i) in 
the variational equations is non-random, the KSP equations are stochastic PDE's. In particular, 
the numerical discretization schemes appropriate to the two mathematical interpretations are 
quite different and lead to quantitatively distinct results. This will be discussed in more detail 
below for the case of discrete-time measurements. 

When the measured function Z(x,t) is nonlinear in x, then our approach must be slightly 
generalized. In this case, we consider the joint action Tx,z,r\*-, z, r], whose minimum over x, z 
with r fixed yields the optimum state estimate x* [r] and also the optimum value of the measured 
variable z*[r]. The advantage to considering this joint action is that it is simply expressed in 
terms of the effective action r#[p] of the observation error, which is still assumed independent 
but not necessarily Gaussian. Indeed, a simple calculation in this case gives 

r^,fl[x,z,r] =r XiZ [x,z] +T r [t-z]. (1.39) 

In contrast, the joint action Tx,r[x-, r] does not have such a simple expression, but instead must 
be calculated via the Contraction Principle as r^^fx, r] = min z Tx.z,r[ x , z, r]. In the case of a 
linear observed variable, Z(x,t) = B(i)x, the joint action Tx,z[ x , z ] is found to be 

[ r x [xl if z = Bx 
r x , z [x,z] = <^ (1.40) 

I +oo otherwise 

Hence I\x-,fl[x, r] = rj[x] + T^[r — Bx] and the estimation strategy we have proposed for a 
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nonlinear measurement function reduces to the earlier one in the linear case. 

The minimization of Fx,z,r[x, z, r] over x, z may be done in two steps, which can be carried 
out independently. These are, first, to minimize 

r Z) j i [z,r] = rzN + r Ji [r-z]. (1.41) 

over all z at fixed r, and, second, to minimize rx,z[ x i z ] over all x with z fixed. From the 
solutions of these two problems, z*[r] and x*[z], respectively, the final variational estimator of 
the state of the system is then the obtained as the composition x* [r] = x* [z* [r]] . The equivalence 
of this two-step formulation with the direct one is an application of the Contraction Principle. 
Clearly, minimizing rj^[x, r] over all x can be achieved by minimizing first Tx,z,n[ x , z , r ] 
over all x with z fixed, and then by minimizing over all z. The minimization over x yields 
the joint effective action of z and r, since Tz t n[z, r] = min x Tx,z,r[ x , z , r ] by the Contraction 
Principle. The minimum is achieved here for some x*[z], the optimal state history x for a given 
z-history. There is no dependence upon r. To see this, observe that the minimization may be 
directly carried out in equation ( I.39| ), with the result that r^^[z,r] is given by ( [I.41 ). The 



Contraction Principle has been employed again to infer Tz[z] = min x Tx,z[x, z]. It is from this 
minimization that x* [z] is determined, which therefore cannot involve r. All of the dependence 



upon measurements is now isolated in (1.41), whose minimization over z yields z*[r]. 

This first minimization of Tz,r[z, r] over z is a problem of the same type as for the case of 
linear measurement functions discussed in the text. As there, a CG-type method applied to 
r*[z] := Tz,r{z>i r] may be employed to calculate z*[r], based upon the Legendre dual relations 

5T 7 5W 7 

hM = H ^[.],.[* i h] = a ^M. <" 2 > 

Any of the algorithms discussed in the text may be employed. For example, in the double 
minimization scheme, the gradient for the outer minimization, 

= h[ t ;z( n )]+R-i(t)[zW(t)-r(t)], (1.43) 
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would be obtained from an inner one. The iteration could be initiated by 

z(°)(t) = (Z(t)) t (1.44) 

and 

h^°Ht) = K- 1 (t)[r(t)-(Z(t)) t ]. (1.45) 

Just as before — but now quite in general — the zeroth-order control h^°'°\t), when substituted 
into the forward-backward equations flL23|) , ( |1.24 ) recovers formally the KSP equations. 



The second minimization of ITx z[x, z] over x is similar. Note that 

r x ,z[x,z] = max{< k,x > + < h,z > - W x ,z[^M) • ( L4 6) 

k,h 

Hence, the problem 

rx,z[x*[z],z] = minmax{< k,x > + < h, z > -W x ,z[k,h]} (1.47) 

x k,h 

is again of minimax type. A doubly iterative scheme would therefore carry out the maximization 
over k, h at fixed x, z to obtain not only T Xi z[x, z] but also the gradients 

k[i;x,z] = ^y[x,z], h[i;x,z] = ^[x,z] (1.48) 

that are used in the next minimization over x (at fixed z). Alternatively, one may solve this 
problem as before via a single minimization over k, h of a functional 

Gx,z[k,h] :=< k,x[k,h] > + < h,z[k,h] > -W xz [k,h] (1.49) 

but with the difference that this minimization is now subject to a nonlinear constraint that 

z[t;k,h] =z(i), te[ti,t f }. (1.50) 

This may be addressed using algorithms from nonlinear programming or stochastic/ quasi- 
random methods. 
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1.5. Estimation with Discrete-Time Data 

So far, we have considered the case where the measurements employed in our estimation are 
taken continuously in time. However, this can only be an idealization of a situation where 
the data are obtained at a discrete series of times. In many practical examples, the instants 
of measurement will be so widely separated that the idealization of continuous acquisition is 
far from valid. It is thus a very practical concern to address the issue of state estimation of 
continuous in time, stochastic dynamical systems such as ( p.l|) based upon discrete-time data. 
In addition, we shall find that some fundamental new concepts are required that are important 
in other contexts. For example, the calculation of ensemble dispersions at an instant of time 
will turn out to be closely related to the problem of estimation with discrete-time data. 

The only change in the statement of the problem in the Introduction is that now the mea- 
surements are of the form 

r k = Z(x(t k ),t k ) + p k , k = l,...,n (1.51) 

where p k represents a measurement error with covariance If the measurement error is taken 
to be an independent Gaussian at each time t k , then the cost function for the observations is 

1 n 

r «b] = ^E^ R fcV fc , (1-52) 
z k=i 

where the sum includes all of the observation times ti,...,t n up to the present time. The 
combined cost function T*[z] := r^^[z,r] for the estimation is then, analogous to ( |I.41| ), 

1 n 

T*[z] = T z [z] + - 5> fe - z(t k )] T K^[r k - z(t k )}. (1.53) 
1 k=l 

For simplicity, we shall only consider here the problem of estimating the optimal z-history. 
As discussed in the previous section, there remains the problem of estimating the optimal 
state or x-history, given the z-history. This can be handled in the same way as discussed 
there. Alternatively, we might formulate the problem as a direct estimation of x. The changes 
necessary to our discussion below should be obvious to the reader. If we seek the minimizer of 
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( 1.53 ), we must satisfy 



AT n 
= N = h[t;z] + £ Rfe 1 ^^) - r k ]6(t - t k ). (1.54) 

Thus, we see that h[t;z*] for the optimal z*[r] must be a sum of delta functions at the obser- 
vation times. This suggests that we consider only the estimation of z at the observation times. 
In fact, we will see that this suffices. 

The cost function H*(zi, z n ) := H Zj r{zx, z n ; rx, ...r n ) for estimating z& := z(t k ), k = 
1, ...,n is obtained in the following way. First, we define a cumulant generating function 

n 

F z (Ai,...,A„) :=log(exp[^AjZ(t fc )]). (1.55) 

k=l 

This is entirely analogous to the cumulant generating functional W^[h] defined in subsection 
2.1. In fact, they are equal with 

n 

h(t) = J2^5(t-t k ). (1.56) 

k=l 

The Legendre transform of F z is the dynamical part of the cost function: 

H z (z 1 ,...,z n ) = max i ^ zjA fc - F z (Xi, A„) > . (1.57) 
Ai,..,A„ [ fc=1 J 

This quantity is called the multitime (relative) entropy. It may also be obtained via the Con- 
traction Principle directly from the effective action through a constrained minimization: 

H z (z 1 ,...,z n ) = min T z [z]. (1.58) 

{z:z(i fc )=z fc ,fc=l,...,n} 

The combined cost function is then 

1 n 

fl*(zi,...,z n ) = fl0(zi,...,z n ) + - 5^[r fe - z fc ] R^ 1 ^ - z fc ]. (1.59) 

1 k=i 

Its minimization yields the optimal values of z(ti), ...,z(t„). The condition for the minimum is 

Afc = R4 1 [r fc -z fc ], (1.60) 



which can already be inferred from ( p. 54 ), (1.56). We may regard ( |I.60| ) as a nonlinear equation 
for either the A's or the z's. 
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To calculate numerically the cost function Hz(zi, • z n ) we see that we must integrate 
the forward and backward equations ( [1, 23 ), (1. 24) with a control field h(i) consisting of delta- 



function spikes, as in (L5q). It is easiest to formulate this integration in terms of suitable 



jump conditions at the observation times. That is, we may integrate the ordinary forward and 
backward Kolmogorov equations ( |1.16j ),( [LT7 ) with h = between the observation times but 



make discrete jumps at those times. We shall show that the proper jump conditions are simply 

e Afc Z(*,t k ) 

V(x, t k +) = vyfa _ ) P(x, t k -), (1.61) 



and 



Here we defined 



A ^ **-) = w(t k -) " 4(x ' tk+) - (L62) 



W(t k -) := J dx e X * Z ^V(x,t k -) = (e X ^ Z ^) t ^, (1.63) 

so that division by that factor guarantees proper normalization of the results after the jump. 

We prove now the validity of these jump conditions. For the first, it is useful to make a 
reformulation of the forward equation ( |I,23 ). The same solution found for that equation may 



be obtained by solving instead 

d t Q(t) = L(t)Q(t) + h T (t)Z(t) ■ Q(t) (1.64) 
and then renormalizing subsequently 

(L65) 

with 

Af(t) := J dx Q(x,t). (1.66) 

It is not hard, by differentiating ( I.65| ) with respect to time, to show that V(t) so-defined 



satisfies ( I.23j ). This is actually a standard device to solve the Kushner-Stratonovich equation. 
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In that context, the analogue of equation (1.64) is called the Zakai equation ^6|. With the 
delta-function control field, we obtain 

dt In Q(t) = + E \ k Z(t k )5(t - t k ). (1.67) 

W W fc=i 

We then integrate in time over the range {tk — e, tk + e) and take the limit as e — > 0. The first 
term on the righthand side is continuous and does not contribute. From the delta function 
contribution we easily obtain 

Q(M*+) = (I68) 

Renormalizing Q(x, ^+), we recover (|I.6lD , as claimed. 

The second jump condition can be similarly obtained. The backward equation ( |I.24| ) must 
likewise be rewritten so that the source terms stand alone before integration. In fact, by 
integrating ( |I.64| ) over x, one finds that 

j t lnM{t) = h T {t)(Z{t)) t . (1.69) 

Using this result, one easily derives from (|L24) that 



Let us now consider the case where h(i) is given by ( |I.56| ), as a sum of delta- functions. Inte- 
grating ( |I. 70 ) over the range {tk — e,tk + e) and taking the limit as e — > 0, then yields 

A{x,t k -)/AT{tk-) ' v ' ; 

Of course, M{t) itself experiences a jump across the observation time t k , changing as we have 
seen by the ratio 

^|±l = (e A^( tfc))tfc _ = w(tfc _ ) . (L72) 

This is a direct consequence of (jLgp . From (F7l|) and ( p72|) , the second jump condition 
immediately follows. 
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Using the jump conditions (1.61) and (|I.62|) to replace the controlled forward and backward 



equations ( I.23| ),( p2^ ), the calculation of the cost function proceeds as follows. Integrating 



(1.69) over the time-interval and comparing with ( I.25| ), we see that 



Fz(A 1 ,...,A„) = log.A%). (1.73) 

By writing N[tf) = Hk=i jv"fa-) ( wnere N(ti — ) = 1 was used), we can decompose this into a 
sum of contributions for each time tk 

n 

F z (X l ,...,Xn) = X)(AF) fc (Ai,...,A fc ) (1.74) 

fe=i 

with 

(AF) fc (Ai,...,A fc ) := log(e A ^)) tfc _ =logW(t fc -). (1.75) 

Whereas the dependence upon A^ is explicit, note that the dependence upon the remaining 
variables Ai, A^_i is only implicit through V(x., tk — )- Having determined Fz(X\, A n ), the 
entropy Hz(z\, ...,z n ) can then be obtained by the Legendre transform formula ( |I.57j ). In that 
formula Zk := z(tk), k = 1, ...,n, with z(t) given for all times t by 

z(t) = f dx Z(x,t)A(x,i)V(xi,t). (1.76) 

It is worth emphasizing that the history z(t) is a continuous function of time. This will be true 
even though the solutions V(x,t),A(x,t) have jump discontinuities at the observation times 
t = tk, k = 1, n. In fact, it is easy to see by a direct differentiation that 

^(t) =< {d t Z + [L*, Z]}.A(t),7>(t) > . (1.77) 
at 

In particular, all of the delta-function sources cancel from this equation. Hence, z(t) is contin- 
uous but will generally have a time-derivative with jump-discontinuities. 

The rest of the estimation protocols outlined in sections 1.4 are the same. For example, the 
double minimization algorithm may be carried out using the Legendre dual pair Hz(z\, z n ), 
Fz(Xi, A n ). The iteration may be initiated by taking 

zf = (Z(tk))t k+ (1.78) 
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and 



\^=K- k l [r k -(Z(t k )) tk _]. (1.79) 

The final result will be an optimal estimated history z*(t) given by ( |L76| ), where V*(t),A*(t) 
therein are the solutions of the forward and backward equations for the X* k , z* k obtained as 
convergents of the minimization algorithm. 

It is worthwhile to compare this procedure for calculating the variational estimator with 
discrete data to that for calculating the optimal KSP estimator in the same circumstances. 
It is shown in Appendix 1 that the optimal estimator may be obtained as well by integrat- 
ing the forward and backward Kolmogorov equations ( |1.16| ),( pl7 ) for V*(t),A*(t) between the 
observation times and by making discrete jumps at those times. The proper jump conditions 



arc 



P*(x, t k +) 



and 



A* (x, t k - 



m(t k - 



i 



■ exp 



exp 



\JZ(x,t k ) - -5Z T {x,t k -)K k 1 5Z{x 1 t k - 



\JZ(x,t k ) - -SZ T (x,t k -)R k 1 SZ{x,t k - 



P„(x,t fc -), (1.80) 



W„(i fc -) 
In these equations 

A fc = Rfc 1 [r fc -(Z(i fc )) ti _], 
the same as the zeroeth-order ( |[.79| ) above, 

6Z(x,t k -) := Z(x,t k ) - {Z(t k )) th -, 



A(x,t fc +). (1.81) 



(1.82) 



(1.83) 



and 



W,(t fc -):= J dx exp \£Z(x, t k ) - ^Z T (x, t k -)H k HZ{x, t k 



V(x,t k -) (1.84) 



is the factor to keep the density "P*(x, t) normalized. The solutions of these equations after a 
single forward and backward integration then yield the conditional probability density, via the 
formula V(x,t\TZ(tf)) = «4*(x, i)"P*(x, t). See Appendix 1. 
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It is now clear that the zeroeth-order variational estimator, calculated after one forward- 
backward sweep initialized with ( p. 78 ), (1. 79), does not coincide with the optimal KSP estimator, 



for the case of discrete-time measurements. The main difference, one can easily see, lies in the 
extra term in the exponent quadratic in d~Z(x, t^ — ). This term has the effect of preventing the 
estimate after the jump from being too far from the prior estimate (Z(ifc))t fc _. The absence 
of this term in the zeroeth-order variational equations helps to make clear in what sense it is 
a "mean-field" approximation of the optimal estimator, obtained by neglect of "fluctuations". 
In fact, if we estimate not the variable Z(i) itself, but rather its sample mean Zjv(i), then 
heuristically A remains unchanged but 5Z^(t) = 0(N~ l l 2 ). Hence, the quadratic term in 
the exponent is 0(1/N) and may be neglected. Of course, it must be realized that we are 
here comparing the optimal KSP estimator with only a zeroeth-order approximation to the 
variational estimator, not to the variational estimator itself for the converged values of A|., k = 
1, ...,n. We hope that, in general, the value of the variational estimator, calculated with the 



"mean-field" jump rules ( |I.61| ),( L62 ) for the minimizing values A£, will not be too far from the 



optimal KSP estimator given by the jump rules (lL80D,(lL8ll) 



The difference between the zeroeth-order variational estimator and the optimal KSP esti- 
mator which we have illustrated above for the case of discrete-time measurements, of course 
also holds in the case of continuous-time measurements. In that case it is a consequences of 
the difference in mathematical interpretation of the zeroeth-order variational equations and the 
KSP stochastic equations, despite their formal identity. Of course, at this point one might 
question the utility of the variational formulation compared with the straightforward Bayesian 
approach based upon the KSP equations. Whether in the discrete- or continuous-time formu- 
lations, it is essentially just as difficult to solve the KSP equations as to make one "sweep" in 
the iterative solution of the variational problem. However, the latter requires in most cases a 
large number of "sweeps" and furthermore provides a suboptimal estimate compared with the 
KSP approach! The advantage of the variational approach will become apparent in Part II, 
when we consider making finite-dimensional approximations. 
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1.6. Calculation of the Ensemble Dispersion 

As discussed in section 1.1, one would like to have not just the mean state x*(i) but also the 
covariance matrix C*(i) in the conditioned ensemble at each time t. As we shall show now, 
the covariance may be readily calculated from the cost function itself. For simplicity, we shall 
confine our discussion to the calculation of the covariance of the measured variable Z(t). The 
changes required for the determination of the state covariance will be obvious. 

Let us discuss first the case with discrete-time data. The entropy function H*(zi, ...,z n ) := 
Hz,r(zi, ■■; z n ; r i 5 •••) r n) that we introduced in ([I.59D of the last subsection has also an in- 
terpretation as a generating function for (irreducible) multitime correlations in the ensemble 
conditioned on rjv(f 3 -) = r,-, j = 1, ...,n. Thus, one may calculate the 2-time irreducible corre- 
lator 

(1.85) 



This irreducible correlator is related to the multitime covariance matrix by matrix inversion: 

C*(t k ,tj) = [T^tj)}- 1 . (1.86) 

The same quantity could also be obtained from ^(A^, A* ) : — Fz,r{^\: A* ; Fi, r n ), the 
Legendre dual of H*(zi, z n ), as 



(1.87) 

A*=0 



From this one may obtain the single-time covariance at any of the times tk by considering the 
diagonal C*(ifc) = C*(t)-,tk). Without loss of generality, one may include any time of interest 
as one of the "measurement times" by simply taking the corresponding value of its observation 
error as infinite, or R/7 1 = 0. 

While this procedure gives the correct result, it is not so practical because the quantity 
of interest, the diagonal C*(tfc), is obtained only through the intermediary of the full 2-time 
covariance C*(£fc, tj). A more useful approach is based upon the single-time generating function 
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obtained from the Contraction Principle 

H*(z;tk) ■■= min iJ*(zi,..,z n ). (1.88) 

z:z fe =z 

This is just the (conditional) relative entropy at time t k . We have chosen here to make the 
time-dependence explicit in the instantaneous entropy. One can calculate the Hessian of this 
function 

d 2 H 

r *(^) = 7nr( z ^) • ( L89 ) 

ozoz 

and then obtain 

C*(t fc ) = [T4t k )}-\ (1.90) 



To employ this method, one must carry out the minimization in ( I.88j ). This leads as before to 
the condition 

^ 1 (z 1 , ...,z n ) := A*(zi, ...,z n ) 

= Aj(z 1 ,...,z n )+Rj 1 (z J --r J -) 

= 0, (1.91) 

for j ^ k with z k = z fixed. This minimization problem can be solved computationally with 
the same methods used to find the global minimum, e.g. the double CG-type algorithm, but 
now with Zfc = z held invariant and minimized only over the remaining variables Zj, j ^ k. 
The result will be the constrained minimizers z*(z;i^) that, substituted into A*(zi, ...,z n ) with 
z& = z, give for all j ^ k. However, 

K{z;t k ) := Afc(zi,...,z n )|- fe=z . - j=z » (z;tfe)jj¥fc (1.92) 

will not be zero. In fact, it is not hard to see that 

dH 

A*(z;i fe ) = — -(z;t fc ). (1.93) 
oz 



Then, from (L8 



r*(4) = — (z;t fe ) 
az 
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(1.94) 



z=z fc 



This gives rise to a simple, practical algorithm to calculate the covariance, by means of a 
finite-difference approximation for some small 5 
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\ a (z+P;t k )- \ a (z-P;t k ) j 
= + i K k \ 

with 



2S (1-95) 



z ±/3 :=z k ±S- ep, (1.96) 

and a unit vector in the /3-direction. We have set A(z; t k ) = X k {z\, z ra )| Sfc=z . Zj=z *( z . ifc ) j^ k - 
This approximation requires the calculation of A*(z; t k ) for the two new values z = z^ displaced 
slightly from z k . This can be accomplished using ( [I.92D and the double minimization algorithm. 
Suitable guesses to initiate the minimization would be 

rm \ z±l3 j = k 
if 1 = _ (1.97) 

and 

a; ui: h, ; . v , i (i.98) 

Of course, z k = z ±/3 is held fixed in the iteration. This procedure must be followed to calculate 
the covariance at each time t k of interest. For each scalar variable, calculating its variance by 
this method is roughly twice the work as calculating the optimal estimate itself over the whole 
interval of time. However, this statement is misleadingly pessimistic. In fact, the initial points 
considered, z k = z ±/3 , Zj = Zj, j ^ k are very close to the optimal history, which is assumed 
known. Hence, only small changes will occur in the %'s, 0{5) corrections to the Zj's, and the 
minimization algorithm should converge quite quickly. The contribution of the various small 
changes can be read off from (|L95). The direct contribution from the change in z k to z ±/3 is 



[C(tfc)]- 1 + Rfc\ where C(t k ) is the covariance in the unconditioned ensemble. The additional 
contributions from the small changes in the z,-, j ^ k will be similar, but will decay according 
to the distance of tj from t k in time. The rate of decay will be determined by some internal 
relaxation or memory time of the system. 
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If the number of variables whose variance is required is large, then even the matrix inversion 
in (1.90) is difficult and should be avoided. This can be accomplished by following an alternative 
procedure, based upon implementing the constraint z k = z by a Lagrange multiplier. In this 
case, (I. 



is replaced by an unconstrained minimization 

H*(z;t k ) := min H(zi, ..,z„; A), 

Zl,...,Z„ 

where 

H{zi, ..,z n ; A) := F*(zi, ..,z n ) + A T (z - z k ), 

and the Lagrange multiplier A is chosen subsequently to impose the constraint z k 
condition for the minimum over all the variables zj, j = 1, ...n, is 



(1.99) 

(1.100) 
z. The 



dH „ 

^-( Zl ,...,Z n ) 



Thus, we see that the minimizing Zj(A;ifc)'s in ( [I.99D are nothing more than 



A*(zi, ...,z„) - X5jk 
0. 



Zj(X;t k ) = z*(Ai,...,A*) 



A^A; X*=0,j^k 



where 



dF* 



Then, using ( I.87| ), one obtains 



(1.101) 



(1.102) 



(1.103) 



9z 



A*=0, 1=1,... ,n 



§(A; tfc ) 
oX 



A=0 



(1.104) 



Incidentally, it is clear also from ( |I.102| ) that the value of the Lagrange multiplier to achieve 
the constraint z k = z is just A = A*(z;tfc) as given in ( L92| ), (|I.93j ). The important point for 
our considerations here is that formula ( |I.104| ) involves no matrix inversion. 

Thus, formula ( |1.104 ) becomes the basis of an alternative procedure to numerically compute 
the covariance. In this procedure, one minimizes H(z\, ..,z n ; A ± ^ 3 ) for the values 

X ±l3 = ±5-ep (1.105) 
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to obtain Zj(X^) t/-), j = 1, ...,n. Then one may approximate 

G a/3W ~ ^ • (1.10b) 

The minimization to obtain the Zj{\ @\ £&) may be carried out with similar methods as before, 
e.g. the double CG-type algorithm initiated with the guesses 

zf = j = l,...,n (1.107) 

and 

Aj°' 0) = HJ 1 [Tj - z 5 \ + X^S jk , j = 1, n. (1.108) 

Our discussion above carries over straightforwardly to the case of continuous-time data acqui- 
sition. The entropy at any time to is, by the Contraction Principle, given as 

H Z:R (z,r;t ) = min T ZjR [z,r] (1.109) 

z:z(t )=z 



with Tz,r as in ( |L27 ). Alternatively, one has 

H ZR (z,r;t ) = minf Zi? [z,r; A], (1.110) 

Z 

with 

f z , R [z,r;\] :=r ZiR [z,r] + A T [z-z(t )]. (LIU) 

Either of the approaches outlined above may be used to find C(to; r )- For example, in the 
second method 

fir. 

(1.112) 



C(t ;r) = f^ ;r,A] 
oA 



A=0 

where z[t; r, A] is the solution of the minimization condition 



= TT#[ Z > r 5 ^ = h fc z ' r l " ~ *«)• ( L113 ) 
dz{t) 



This is solved with the double CG-type algorithm, using jump conditions ( |I.61 ),(I,62) at to 
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II Moment- Approximation of the Optimal Estimator 

II. 1. The Rayleigh-Ritz Method 

Until now all of our theoretical work has been exact and without any approximation, other than 
that involved in conditioning on sample averages, the "mean-field" approximation discussed in 
the section 1.3. However, it is clear that additional approximations are required to achieve a 
computationally tractable estimation scheme for spatially-extended or distributed systems. As 
discussed in the Introduction, the exact calculation of the optimal nonlinear estimator by the 
KSP equations is already known to be numerically unfeasible in such situations. Furthermore, 
computation of the exact variational estimator is just as impractical as the computation of the 
exact KSP optimal estimator for a system with a large number of degrees-of-freedom. It will 
not be possible for almost any system of real, practical interest. The existence of a variational 
principle does not ameliorate the basic computational difficulty imposed by the enormously 
many variables. Just as for the KS filter, moment-closure appears to be the only tractable 
numerical approach to an approximate solution. The advantage of the variational formulation is 
that it permits finite-dimensional approximations to be constructed by a Rayleigh-Ritz method 
which preserves the main structural properties of the exact estimator, discussed previously. We 



shall briefly discuss these features here, referring to previous works [11, 13 1 for many details. 

The Rayleigh-Ritz approximation to the cost function is obtained by means of the charac- 
terization of that functional through the constrained variation in ( [I.19D . Rather than varying 
over all A G L°°,V £ L , one varies only over finitely parametrized trial functions. The trial 
functions are constructed from the usual elements of a moment-closure: a set of moment func- 
tions Mj(x, t), i = 1,...,R and a PDF Ansatz t;fi), which is conveniently parametrized 
by the mean values which it attributes to the moment-functions, n ■= f dx V(x,t; /i)M(x, t). 

The left trial function may be taken to be A(t) = 1 + [B(t) — (B(t))t] with 

R 

B(x,t;a) :=J2 a i M i(^ t )- ( IL1 ) 

i=l 



Following the discussion in section 2.3, we have chosen the left trial state in the form ( p. 22 ) 
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to incorporate automatically the overlap constraint ( p.20|) . The histories a(t),fi(t) are the 
parameters to be varied over. Substituting the trial forms, one obtains the reduced action 

T[a,fj,}= alt a T (t)[ii(t) -V(/x(t),t)] (II.2) 

with 

V(j*,t):=((dt+L*)M(t)) m . (IL3) 
Of course, (')u(t) denotes average with respect to the PDF Ansatz. An unconstrained variation 



of (II. 2) recovers the standard moment-closure equation: fi = VQtt, t). For the calculation of 
the action, however, there is the additional expectation constraint ( |I.21[ ). In terms of the trial 
functions, it becomes 

z(t) = C(Kt),t) + C z (l*(t),t)a(t). (II.4) 

Here, 

COM) := <Z(i)>/x (II.5) 
is the Z-expectation within the PDF Ansatz and 

C z (n, t) := (Z(t)M T (t))f, - C(/i, t)/x T (H.6) 

is the corresponding ZM-covariance matrix. It is remarkable that C(M) t), Cz(/i, t) are the only 
inputs of the PDF Ansatz actually required for the calculation. 

When the constraint ( p.4| ) is incorporated into the action functional ( pi.2[ ) by means of a 
Lagrange multiplier h(t), the resulting Euler-Lagrange equations are 

ii = V(M,t) + Cj(Ai,t)h(i) 

:= V z (j*,h,t). (11.7) 

and 

« + (^) T (/x, h, t)a + (|^) T (/., t)h(t) = 0. (II.8) 

These are solved subject to an initial condition fj-{ti) = /Lt and a final condition ct(tf) = 0. 
When the solutions of the integrations are substituted into ( |II.2| ) , there results a Rayleigh-Ritz 
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approximation Tz[z] to the effective action of Z(t). The value z{t) of the argument is that given 
by the constraint equation ( |II.4| ) for the given value of the control field h(t). A corresponding 
approximation of the cumulant generating functional is given by 

w z [h] = r dt h T (t)c(/i(t), t) (H.9) 



in which fi(t) is the solution of just the forward equation (II. 7) for the control history h(t) 



It is a very attractive feature of the above approximation scheme that the resulting func- 
tionals I~z[z], W^[h] remain formal Legendre transforms of each other. That is, 

W z [h] +f z [z] =< h,z > (11.10) 

and 

*^si H ' (IL11) 

This fact makes it possible to carry over directly all of the minimax algorithms discussed in 
section 1.4 for determination of optimal histories using the exact cost function to the Rayleigh- 



Ritz approximate one. Incidentally, the form of the constraint ( II. 4 ) makes it more apparent 
that this approach generalizes the "sweep method" employed in the case of linear dynamics § . 

The iterative constructions that were discussed for the exact optimal estimator can be 
followed also to calculate the moment-closure approximation. For example, consider the two- 
step method. As the first step, one can calculate the approximate optimal z*[r] given r, by 
minimizing 

f^[z,r]=f^[z] + i f tf dt [r(t) - z(t)] T -R~\t)[r(t) - z(t)}. (11.12) 

over z with r fixed. This can be accomplished, for example, with a double CG method as 
before, taking now 

z(°)(i) = C(AM) (11.13) 
h(°'°)(t) = R- 1 (t)[r(t)-C(M,0] (H.14) 
as the zeroth-order inputs. It is clear that ( [II. 14 ), substituted into the approximate forward 



equation ( II.7| ), is formally equivalent to a moment-closure of the KS-equation. (Although it 
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must be emphasized once more that, in the case of the KS filter, the closure equation analogous 
to ( [II. 7 ) must be regarded as a stochastic differential equation.) The second step is to calculate 



the approximate optimal x*[z] by minimizing rx,z[x, z] over x with z fixed. Of course, the 
Rayleigh-Ritz approximation fx,z[x,z] is calculated by the analogous equations as ( |II.7j ),( |l"L"8| ): 

ii = V(/x,t) + Cj(/x,t)k(t) + Cj(/i,t)h(t) 

:= Vx,z(/i,k,h,t). (11.15) 

and 

a + (^f) T (/*, k, h, t)oc + (||) T (/*, t)Ht) + (|^) T (/i. i)h(t) = 0, (11.16) 

where £(/i, i), Cx(fJ>,t) are the closure X-mean and XM-covariance, respectively. The final 
approximate estimator is then the composition x*[r] = xjz^r]]. 

However, there is a potential difficulty in applying the minimization algorithms: the Rayleigh- 
Ritz approximations to the cost functions need not be convex at all! Lack of convexity would 
correspond to a failure of realizability of the predicted multi-time correlations |ll]] . As a conse- 
quence of this failure, there might exist local minima in addition to the global one or, possibly, 
no minimum at all, local or global. In the former case, a CG algorithm could be trapped in a 
local minimum, and, in the latter, it would not converge at all. Thus, for numerical purposes, 



it is exceedingly desirable to maintain convexity. It was shown in [12] that convexity will be 
maintained — at least for an expansion of the action to quadratic order in small departures from 
the minimum — whenever the relative entropy is a Lyapunov stability function for the closure 
dynamics. It is possible to construct closures for nonlinear stochastic dynamics which guaran- 



tee the validity of such an -theorem [27], using methods previously developed for Boltzmann 
kinetic equations in transport theory p8| . One example of the general scheme are closures 
based upon an exponential PDF Ansatz. Such closures have the property that the relative 
entropy satisfies an ^/-theorem and thus (local) convexity of the Rayleigh-Ritz approximations 
is guaranteed. This is discussed further in |27j and in section II. 3 below. 
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II. 2. Discrete-Time Data and Ensemble Dispersion 

We have seen in section 1.5. that the estimation problem based upon discrete-time data has, 
in the exact formulation, a simple solution in terms of certain jump conditions. The situation 
is worse for closure approximations. In fact, we shall see below that, for general moment- 
closures, the approximate "smoother" with discrete-time data may not even be continuous at 
the observation times! This is an important failing since these same methods are also involved 
in the calculation of instantaneous ensemble dispersions, as we have seen in section 1.6. 

Let us illustrate the nature of the problem for a general moment-closure. If one differentiates 
the expression for z(t) in ( |II.4j ) with respect to time, using the variational equations ( |II.7D , ( [II. 8| ), 
simple computations give a result of the form 

dVi 



+h b (t) 
+h b (t) 



dC z 



dm dm 

dC z dC z 



a : 



(11.17) 



For any function of fj,, t we set ^ := Jj+V(/x, i)-V^t. Equation ( |II.17 ) should be compared with 
the exact result in ( |1.77 ). In contrast to the cancellation of the explicit h(t) terms found there, 
such terms remain in the second and third lines above. Only in the case where a single scalar 
variable z(t) is considered and thus a = b = 1 is there an obvious cancellation in the last two 
terms. This means that, in general, the delta- functions will not cancel if one considers a control 
field of the form h(t) = J2k^k$(t — tk), as appropriate for discrete-time data, and z(t) itself 
will have jump-discontinuities at the measurement times tk- Of course, one take observations, 
not instantaneously, but instead averaged over a small interval of time r. The delta functions 
are then replaced by approximate delta's 5 T {t — tk) with time- window r. However, the problem 
will reappear when r is taken very small, for then z(t) will change sharply at times tk- 

A related problem has to do with the formulation of proper jump conditions in the same 
circumstances. Let us even assume that z(t) is a single scalar. Then, the forward equation for 
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the moment variable fi becomes 

in = Vi(n, t) + h(t)cf(ti, t). (11.18) 

If h{t) is a sum of delta- functions, then one cannot integrate the equation to obtain the jumps 
in fii at the observation times. The difficulty is that Cf((i,t) will then also have jump- 
discontinuities at those times and it is impermissable to integrate a delta-function against 
a discontinuous function. The obvious strategy is first to divide both sides by C z (fx,t) and 
only afterward integrate across the jump. There is still a problem however. The other variables 
besides m in the integrand also make jumps and it is therefore ambiguous which value should 
appear as the integration range shrinks to zero. Thus, the strategy only works in the case where 
there is also a single scalar moment variable fx(t). In that case, we can integrate and obtain a 
jump condition in the form of an "area rule": 

We have set [it = /i(ifc±). The backward jump condition for the adjoint variable a then follows 
most easily from the continuity of z(t) noted above. With notations as above, = a(£fc=t) 
and so forth, we have 

tt + C Z k + ai = ^ + C z k ~a- k . (11.20) 

The jumps in £, C z are known, because these are assumed continuous functions of //, t and the 
jump in \x is known from ( |II.19 ). Solving for the backward jump gives 



a~ k = (Cfc+ Ck) c t CZk+a+k ■ (IL21) 

Hence, only in the case of a single scalar moment function and observation variable is it obvious 
how to formulate jump conditions, in the case of a general moment-closure. 

It still remains in that case to formulate the algorithm to calculate the cost function itself. 
The proper definition turns out to be 

n 

F z (\i, An) := £(AF) A (Ai, A fc ) (11.22) 
k=i 
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where the increment at time tk is given by a second area rule: 

jfTOJ**^ (IL23) 

There is a simple heuristic motivation for both this rule and the previous one. In fact, the basic 
approximation is to replace the exponentially-modified PDF by a PDF from the Ansatz with 
an adjusted moment. That is, 



wn 1 - T r e^(*^P(x, t fc ; ) » P(x, t; M + 



(11.24) 



with the normalization factor 



W(A fc ;^,t fe ) := J dxe A ^( x '«P(x,t fc ;^). 



(11.25) 



The M-moment of ( pT24| ) is 



Mfc (Afc;/i fe ,*fc) := 



1 _ - : / (ixM(x,t t )e A * A >P(x,f fc ; ft - 



M*. , tit) 



(11.26) 



Differentiating once and using again (11.24) thus gives 

BiiJ _ z + 



(11.27) 



The first area rule ( [II.19D is just an integral form of this latter relation ( II.27| ). Likewise, if we 
define 

(AF) fc (A fc ;/ifc ) tfc) :=logW(A fc ;^,t fc ), (H.28) 
then we see by applying ( II.24| ) twice again that 

d(AF) k 



(11.29) 



The second area rule ( 11.23 ) is likewise the integral form of ( II.2S| ). Note from ( II.28Q that all 
of the dependence of (AF)k upon Ai,...,Afc_i is through , analogous to (|L75| ). The cost 



function Hz(z\, ...,z n ) is finally defined as the Legendre dual of Fz(\i, A n ) given by (|L22|). 
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The jump conditions |L19|) ,( FL2lD may be used very much as the exact ones QI.61| ),( |r6^ ) for 



the purposes of estimation with discrete-time data and of ensemble variance calculation. Only 
the Rayleigh-Ritz approximations of the cost functions need be substituted for the exact ones 
in the algorithms described earlier. In calculating the Legendre dual H%{z\^ ...,z n ) the adjoint 
equation is used to evaluate z k = C fe + C z a k . Of course, one should check that this gives the 
same result as direct differentiation z k = . This is true but we shall not give the proof here, 
because we prove a very similar result in the Appendix 2. The proof is based upon the easily 
established relations 

84 _C Z + 5(AF) fc _C fc + -C (n 30) 



6^ C*-> dp' CZ- ■ 

Of course, the adjoint equation need not be employed at all, but it is a convenient way of 
evaluating the required derivative. 

Substantial simplifications in the jump conditions occur in the important special case where 
Z = M. In that case Ci/^^t) = Mi C z (/i,t) = C(p,t). We can then define a function 

*<"•'> -J^cfij < IL31 > 

where n(t) is the solution of the unperturbed moment equation. If fi(X, t) is the inverse function, 
then we can also define 

F(X, t) := [ X n(\,t) dX. (11.32) 



o 



It follows by our definitions that 

F'(X,t) = p, F"{X,t) = C. (11.33) 
In terms of the function X(p,t) the first area rule (|II.19|) becomes 

X(p+,t k )-X(^,t k ) = X k . (11.34) 

Also, using ( II.33; ) we note that 
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/i(A) dX 

= F(\+,t k )-F(\-,t k ). (11.35) 

Hence, the "area rules" are replaced by equations involving discontinuities of explicit functions, 
always assuming, of course, that integrals defining the functions in ( [II. 31 ), (11.32) may be evalu- 



ated. The key to this simplification was the relations in (11.33), which imply that F is a convex 
"potential" generating the first and second moments of the PDF Ansatz. Such a potential will 
always exist for functions of one variable, but not in general for multivariate functions. 
II. 3. Exponential PDF Closures 

We have seen above that, for a general closure, there is a satisfactory treatment of estimation 
with discrete-time data only for the case where there is both a single measured variable Z(t) 
and a single closure variable M(t). Obviously, this is an extreme limitation. However, it may 
be possible to circumvent this severe restriction within special classes of closures. In fact, as we 
show now, closures constructed with an exponential PDF Ansatz have better properties. We 
shall see that they guarantee continuity of optimal estimators. Furthermore, they provide very 
simple "jump-conditions" for estimation with discrete-time data. 



Exponential PDF closures are one example of the general class considered in [27]. Hence, 
we shall only make a quick summary of the properties required here and refer the interested 
reader to the paper [p?]] for more details. Most concretely, the class of closures we consider are 
those built from a PDF Ansatz of the exponential form: 

exp(A T M(x,t)) ^ 
V(x, t; A) = ^ P*(x, t) (11.36) 



with 



M{X,t) := J dx exp(A T M(x,t)):P*(x,£). (11.37) 



Here V*(x,t) is a reference PDF. To guarantee some of the good properties of the closure 
discussed in [^7j , the reference PDF must be a solution (or approximate solution) of the Fokker- 
Planck equation. However, for the properties discussed here, V*(t) may be an arbitrary PDF. 
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The exponential family in (11.36) is parameterized by the "potential" variables A, rather than 



by the moments /x of the closure variables M(x, t). However, there are simple relationships 
between these quantities. We may define 

F(\,t) := log AT (\,t), (11.38) 

which is a cumulant-generating function for the variables M(t) in the PDF Ansatz. Likewise, 
its Legendre transform 

H(fA, t) := max |/i T A — F(X, i) j (11.39) 

is a generating function for irreducible correlation functions of M(i). It is the relative entropy 
for the PDF Ansatz in ( |II.36 ) with respect to the reference PDF P*(t). Under some conditions 



discussed in 27], it satisfies an H-theorem for the closure dynamics constructed with the Ansatz. 
However, the role of F, H as generating functions will be more important for us here. Thus, 
fj, = ^ and conversely A = Ijjj. It is a consequence of the former that is the covariance 
matrix C of M and that |y is the 3rd-order cumulant. These relationships will prove to be 
important in the following. 

We shall now show that, for the exponential PDF closures, the history z(t) is continuous 
even for h(i) consisting of delta-function spikes, when the variables Z(t) are among the closure 
variables M(t) themselves. This last condition places some restriction, but a fairly modest and 
natural one. Without any loss of generality, we can consider Z{t) to consist of the entire set 
of closure variables M(t). As before, some of our previous formulas then simplify considerably. 
For example, C(m> t) = A 4 an d C z (fj,,t) = C(/x, t), the usual MM-covariance matrix. Then 
( II. 4| ) is replaced by 

m(t) = £t(i) + C(fj,(t),t)a(t). (11.40) 



The time-derivative of the latter, given in general in ( [11.17 ), also simplifies. In fact, the term 
in the bracket in the second line becomes 



C ba (n,t) - Cabin, t) = 
45 



(11.41) 



which vanishes by the symmetry of the covariance matrix. The term in the bracket in the third 
line of ( 11-17] ) becomes 

C ajb ( t ji,t)-C bja ( t x,t)=0 (11.42) 



where C a f, c (/i, t) is the 3rd-order cumulant of M(i). Indeed, 

CajkTki. (11.43) 



9Cgj _ dC a j d\k 



Since the irreducible 2nd correlator is the inverse covariance matrix, T = C _1 , the expression in 
( II,42j ) follows from the corresponding expression in ( [II.17 ). However, it is obvious that ( [11.42 ) 



vanishes, by the symmetry of the 3rd-order cumulant. Putting together all of these results, we 
have 

Am r d / f)n.\ / f) V \ 

ex. (11.44) 



dm . . __. . 



d /dfj,\ fdV\ 
Jt VdX) ~ \~d\J 



This should be compared with the exact expression ( |1.77 ). Just as there, we see that the terms 



directly involving h(t) all cancel. Hence, m(t) remains continuous even with h(i) containing 
delta-function spikes. 

We shall finally show that the exponential PDF closures also permit the formulation of 
simple jump conditions at the times tf. where the delta functions occur. This should not be too 
surprising, when one considers that the exact jump conditions in ( ]I.61| ),( p62 ) consist simply of 



suitable exponential modifications of the solutions of the foward, backward equations. To derive 
the jump conditions in the closure, we use a strategy motivated by that in section II. 2. In fact, 
observe by C = |^ and the chain rule that /!t = C(A) A. Thus, if one defines W(A) := r(/i)V(/i) 
and 7 := C(A)a, then in terms of the new variables 7, A, the nonequilibrium action, including 
the constraint term with the Lagrange multiplier, becomes 

r[ 7 ,A]= f f dt { 7 T [A-W(A)]+h T (0[m(t)-/x(A)- 7 ]}. (11.45) 

The Euler-Lagrange equations in terms of these variables become 

A = W(A,t) + h(t), (11.46) 
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'dW\ T 

7+l^H (A,i) 7 + C(A,i)h(i) = 0, 



dX 



(11.47) 



and the constraint equation 



m(t)=j*(A,t)+7. (11.48) 
In the first equation ( |II.46 ) we may integrate across the spike X k 5(t — t k ) in h(t) to obtain 



A(/u£, t k ) - \(fj, k , t k ) = X k . 



(11.49) 



These are the forward jump conditions. As should not be unexpected, the potential A(/x, t) is 
simply incremented by X k at the spike. A similar result can be obtained by integrating the 



backward closure equation (11.47) across the spike. However, it is simpler to use the continuity 



of m(t) at the jump, which was established above. Then from (II.48|) one immediately derives 



7 fc = (/4-/^)+7fc- 



(11.50) 



These are the backward jump conditions. 

The multi-time cumulant-generating function F(Xi, A n ) can be obtained from ( |1.73 ) with 



the observation that N(t f ) = Y\t =l k > and thus (Ol) holds with 

N{X k ,t k ) 

(AF) fc (Ai,...,A fe ) = F(X+,t k )-F(X^,t k ) 



(11.51) 



generalizing ( [II.35 ). Then the multi-time entropy H (mi, m„) is obtained by the Legendre 
transform 

n 

%i,..,m n ) = ^2mJX k -F{X l ,...,X k ). (11.52) 

fc=i 

with rrifc given by ( II.40| ), m^ = m(t k ), for t = t k , k = l,...,n. Of course, it must be shown 
that 



m/, 



<9F 
<9A fc 



(11.53) 



for all k = 1, ...,n in order for (11.52) to be valid. Cf. equation ( I.57| ). The proof is somewhat 
technical, so it is given in the Appendix 2. 
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While the previous approximation has a rather elegant and tractable formulation, there 
is nevertheless also an unpleasant asymmetry between forward and backward time directions. 
Thus, information propagates forward in time via the nonlinear closure equation ( [II.46 ), but 



information propagates backward in time via the equation (11.47) which is linear in the adjoint 



variable 7. Ultimately, this asymmetry is due to our employment of a nonlinear (exponen- 



tial) Ansatz (II.36|) for the PDF, while the solution of the backward equation is taken to be 



of the linear form (II. 1). However, there is nothing in the Rayleigh-Ritz method which re- 
quires the use of the linear Ansatz ( |II.1| ) for the left trial state. In fact, that expression has 
other unpleasant features. The exact backward Euler-Lagrange equation flL24|) is known to be 
positivity-preserving, so that the solution *4(x, i) starting from final data A(t) = 1 must be 
everywhere nonnegative. However, the linear Ansatz «4(x, t) = 1 + Y^i=i oti(t)[Mi(x., t) — Hi{t)\ 
may easily become negative, if the adjoint variables ex become large enough in magnitude. It is 
therefore desirable to consider more general Ansdtze for the left trial state than the linear one. 

Within the context of exponential PDF closures a particularly symmetric and attractive 
choice is to make the double exponential Ansatz: 

V{x, t) = exp [/3 T M(x, t)-F(f3, t)] P* (x, t) (11.54) 

for the right trial state and 

A(x, t) = exp [q t M(x, t) - (A a F) ((3, t)l (11.55) 

for the left trial state. Here (AaF)((3, t) := F{a. + f3,t) — F(j3,t) so that the normalization 
constraint < A(t),V(t) >= 1 is automatically satisfied. It is then clear that, for small a, ( [11.55 ) 



coincides with the linear Ansatz. (Note that (AcxF)(/3, t) = ot T fi(/3,t) + 0(a 2 ).) However, this 
new Ansatz is globally nonnegative and symmetric in form to the exponential for the right 
trial state. An even more attractive feature of this double exponential Ansatz is that, within 
it, the Rayleigh-Ritz effective action of the closure variables M themselves may be calculated 
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analytically in closed form. The result is: 

r[m] = ~ ! 1 dt [m(t) - V(m, t)] T Q _1 (m, t)[m(t) - V(m,t)], (11.56) 
4 7t 4 

where 

Qyfot) := ((V x M i ) T D(V x M J )) A(mjt) . (11.57) 

This effective action has precisely the Onsager-Machlup form. The statement generalizes a 
previous result in for general closures, that the Rayleigh-Ritz effective action has the 
Onsager-Machlup form to quadratic order. Let us just briefly sketch the derivation, which will 
be given in detail elsewhere [29|, along with a complete discussion of its remarkable properties. 



It is a straightforward calculation to show that 

(d t +L*)A(x, t) = {c* T M(x, t) + t* T M(x, t) + V x (q t M)-D- V x (c* t M) - A a F(J3, t)\ A(x, t). 

(11.58) 

In that case 

< {d t + L*)A(t),V(t) >= a T /i(A, t) + ct T V(A, t) + a T Q(A, t)a - A a F(J3, t), (11.59) 

where A := a + (3. However, it is easy to see that the second constraint on the mean values 
becomes in these variables m(i) =< A(t), M.(t)V(t) >= /x(A, t). Thus, holding the history m(i) 
fixed is equivalent to holding X(t) fixed. We cannot vary independently over ot(t) and (3(t), 
but one is determined from the other via the relation X(t) = ct(t) + (3{t). Since, for fixed m(t), 
( II.59| ) implies that 

T[a,/3] = l S dt {a T [m- V(m,t)] - a T Q(m, , (11.60) 



maximizing over a yields (|L56) 



Although the Onsager-Machlup form (|II.56| ) is most interesting for theory, practical estima- 
tion is easier with the latter expression ( |II.60| ). Including the cost function for the observations, 
the total action to be minimized is 

1 A 

dt <! a 1 [rh — V (m. t)\ — a C4(m, t)a S- + 

Hi ~ fc=i 



r*[a,m] = J dt {a T [m-V(m,t)]-a T Q(m,t)Q} + -^[m(t fc )-r fe ] T R^ 1 [m(t fc )-r fc ]. 

(11.61) 
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The Euler-Lagrange equations of this problem are 

m = V(m,t) + 2Q(m,t)a, (11.62) 

dm") a+ d^i( aTQa ) =J2' R k 1 l m (tk)-r k ]5(t-t k ). (11.63) 
Solving these equations with boundary values m(ij) = mo and ct{tf) = can give directly 
the optimal history, without the need of applying any explicit minimization algorithm. It is 
transparent in this formulation that the optimal history m* (t) is continuous at the observation 
times, because the first equation ( II.52| ) contains no delta-functions in time. Only the adjoint 



variables a(t) suffer jumps at the measurement times t = t k . 

The same circle of ideas may be applied to constructing closures of the KSP equations for 
the optimal history itself, rather than just the variational approximation. In fact, assume that 
the closure variables M consist of the measured variables Z and their tensor products Z <g> Z, 
M := (Z, Z (g) Z), with mean values given by m = (£, S) for a double exponential Ansatz. 
Let the exponential parameters in the left trial state then be denoted as (a, A) and those in 
the right trial state as (/3,B). Because the states evolve by the (unperturbed) forward and 
backward Kolmogorov equations between measurements, the Euler-Lagrange equations within 
the closure are of the same form as those in ( |II.62j ),( |il.63 ). As there, there are no jumps at 



measurement times in the equations for m = (£, X). On the other hand, there are simple jump 
conditions for the adjoint variables (a, A), which may be read off directly from (|A.3|),([A~7^): 



a k = a k (H.64) 
A fc =A+-iR fe -\ (11.65) 

for k = 1, ...,n. Further details, including the formulation for continuous-time observation, will 
be given elsewhere. We only note here that there is a price to be paid for constructing a closure 
of the KSP equation: the necessity of including among the closure variables the squares of the 
observed variables in addition to those variables themselves. 
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Ill Conclusion 



This paper is intended to serve as a primer and technical reference for the application of the 
proposed variational estimation scheme to concrete problems. We have discussed the meaning of 
the variational estimator within ensemble theory and emphasized its character as a "mean-field 
approximation" to the optimal estimator. Neither the variational method nor the optimal KSP 
method can be directly applied in practice to complex, high-dimensional systems. An action 
functional can be used to construct Rayleigh-Ritz or moment-closure approximations of both 
the variational and KSP estimators, but the variational scheme has the advantage of requiring 
simpler, lower order closures. We have discussed a number of special closure schemes, based in 
particular upon exponential Ansdtze, that preserve good properties of the exact estimators. We 
have discussed also the numerical implementation of the variational estimation scheme, both 
exactly and within a Rayleigh-Ritz approximation, both to obtain the estimator itself and also 
to approximate the variance or ensemble dispersion. Most of the algorithms discussed here have 



already been implemented in ][3(| and in our forthcoming work [31]. 

In addition to providing a practical estimation scheme, we hope that the variational frame- 
work will provide also some additional physical insight into the complex stochastic systems to 
which it is applied. It exploits a thermodynamic formalism for far from equilibrium systems 
and provides a motivation to understand better the concepts of action and entropy in concrete 
physical systems, e.g. atmospheres, oceans, ecosystems, living organisms, etc. 
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A Appendices 



Appendix 1: Optimal Estimation with Discrete-Time Data 

We give here a simple derivation of the Kushner-Stratonovich-Pardoux equations for estimation 
with data taken at a discrete set of times tp., k = 1, ...,n. The problem set-up is the same as 
described in Section 1.5. We define V*(x,t) := "P(x, t\ri, ...,rfc) for t k+ % > t > t k , so that V*(t) 
is right-continuous in time. It is then clear that between measurement times, V*(t) evolves by 
the forward Kolmogorov equation ( |1.16 ). At measurement times, 



V*(x,t k +) = V*(x,t k - \Z(t k ) + p k = T k ) 



(A.l) 



for k = 1, ...,n. Thus, by Bayes' rule, 

V*(Z(t k ) + P k = r fc |x,t fc -)7 : '*(x,i fe -) 



P*(x,t fc +) 



(A.2) 



Jdy V*(Z(t k ) + p k = r k \y,t k -)V*(y,t k -)' 

By our assumptions, p k is a normal random variable of mean and covariance R^, independent 
of the process X(i). Hence, if Z, p are s-dimensional 



V*(Z(t k ) + p k = r k \x,t k -) 



v /(2vr) s Det R, 



: exp 



1 



(Z(x, t k ) - r k ) T R fe 1 (Z(x, - r fc ) 



(A.3) 



The term 



^/(27r) s Dct R fc 



cxp 



-ir7R, 1 r k may be cancelled between numerator and denominator 



in (|A.2|). Hence we obtain finally the forward "jump condition" 



cxp 



7 7 *(x,^+) 



with the normalization factor 



W(ri,...,r fc ) := J dy exp 



r TR fe 1 2( x ,t fc ) - iZ T (x,t fc )R^ 1 S(x,t fc ; 



W( ri ,...,r fc ) 



rjR^Z(y,t k ) - -Z T (y,t k )K^Z(y,t k ) 



■P*(x,t fc -) (A.4) 



V*{y,t k -). (A.5) 



Next, we define for t k _i < t < t k , 



A(x,i) := 



P(x,i|ri, ...,r n ) 7 ? (x,t|ri,...,r n ) 



P*(x,t) 



P(x,i|ri, ...,r fc _i) 



(A.6) 



52 



and for t > t„. 



Writing this definition as 



A(x,t) := 1 



P(x,i|ri,...,r n ) P(x,t|ri,...,r fc ) 



P(x,t|ri, ...,r k ) P(x,t|ri, ...,r fe _i) 
and using the already derived condition (A. 4), we obtain for t — > t k — that 



(A.7) 



(Ai 



A*(x,tk—) = A*(x,t k +)- 



exp 



rjR fc ^(x, t k ) - \Z 1 (x, t fc )R- i Z(x, t k ) 



(A.9) 



W(ri,...,r fe ) 
This is the backward "jump condition". 

It remains only to show that «4.*(x, t) defined via QA.6| ) satisfies the backward Kolmogorov 
equation ( |L17| ) between measurements. We apply again Bayes' rule, in the form 

V{r k , ...,r n |x,i;r 1 ,...,r jt _i)'P(x,i|ri, ...,r fc _i) 



Px,in,...,r n 



However, by the Markov property, 



V(r k , ...,r n |ri, ...,r fc _i) 



(A.10) 



7 ? (r jfc ,...,r n |x,t;ri,...,r fc _i) = V{r k , ...,r n |x,i) 

= J dy k V(r k ,...,r n \y k ,t k )V(y k ,t k \x,t) (A.ll) 



when t fc _i <t<t k . Putting together (|A^),( |A.10| ),( [A.11| ), we conclude that 

V{r k , ...,r n \y k ,t k ) 



A*(x,t) = / dy k 



-V(yk,tk\x,t). 



P(r fc ,...,r n |ri,...,r fc _i)' 
Since the transition probability satisfies the backward equation in the variables x, t 

(d t + L*)V(y k ,t k \ X ,t)=0, 



(A.12) 



(A.13) 



it then immediately follows from the integral representation ( A.12| ) that A*(t) satisfies ( [1, 17 ) 
for t k ^i < t < t k , k = 1, n. 

It is not hard to show that the jump conditions above, ( |A.4[ ),( [ATS| ), are equivalent to those 
given in the text, ( I.80| ),( p8l| ). One simply multiplies the numerators and denominators in 
(A. 4), ( A.9 ) by the factor exp — ^(Z(t k ))J k _~R, k ~ 1 (Z(t k )) tk - and rearranges terms in the expo- 
nents by completing the square. 
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This is an appropriate place to discuss the close formal resemblance of the KSP jump condi- 
tions, (pQD.flnnp, to the jump conditions, ( [L61| ),( |L"62| ), employed in calculating the multitime 



entropy Hz (or, more correctly, its Legendre dual Fz-) In fact, the exponential PDF Ansatz 
( II,36| ) has also an interpretation as a conditional PDF. The conditioning is now upon the event 



that the empirical average Zjy = z in the limit as N —* oo: 



w Mm P.(x,t|5Mt) = z) = g^g|^) p,(x,t) (A.14) 

with A = X(z,t). More precisely, the result is that nmjv-«>o T 5 ® (xi, ..,Xjv,£|Zjv(i) = z) = 
nili ^^ffi^ ^Cxi^). Here the product measure ^(x^-.x^jt) = is 
taken, to correspond to an ensemble of N independently prepared samples. Convergence to 
the new product measure holds for any finite-dimensional marginals (i.e. for i G S, any finite 
set, as N — > oo). Statistical physicists will recognize this as an equivalence of ensembles result, 
in which the "micro canonical ensemble" corresponding to the condition Zjv(i) = z becomes 
equivalent in the thermodynamic limit to the "canonical ensemble" with potential A(z, t). 

As a consequence of this, we may interpret the solution V*(x,t) of the forward equation, 
with the jump conditions (|I.61| ) at measurement times less than t, as 

P*(x,i) = V(x,t\zi, ...,z fe ), tfc+i > t>t k , (A.15) 

where the righthand side is shorthand for the PDF conditioned upon the event Zjv(ii) = zx, 
ZAr(tfc) = Zfc in the limit — > cxo. Likewise, 

A(x,t)P*(x,t) =P(x,t|zi,...,z„), (A.16) 

where the conditioning is now upon Zj\r(tj) = Zj for the full set of times tj, i = l,...,n, both 
those before and after the time t. The proof of this assertion is exactly the same as for the 
corresponding results proved earlier in this appendix regarding PDF's conditioned upon obser- 
vations rj., r n . This relation to conditional PDF's helps to explain the close similarity to the 
KSP formalism. Note, however, that this is the wrong set of conditions to use for estimation, 
as it is upon Zn itself and not upon (empirical means of) observations r^v = Zjv + Pat- 
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Appendix 2: Adjoint Calculation of a Derivative 

For completeness, we shall give a direct proof of ( II.53| ) here. We first observe by (1.74) and 
causality that 



dF _ " g(AF); 



Then we note from ( 11.51 ) that 



while 



d(AF) k + _ 

-^x~r = ^ - ^ 



(A.17) 



(A.18) 



(A.19) 



Thus, by (A.17),( A.1§| ) and the chain rule 

dF 



dX k 



l=k+ 



1 duf d\ k 



(A.20) 



Furthermore, by (A.19) and the chain rule, 

d(AF), d(AF)i dXj~ 



(A.21) 



Therefore, it only remains in ( A.20| ) to evaluate ^ for I > k. The Jacobian matrix for 
arbitrary times t ^ ti, I > k satisfies the linearized equation 



dX k 



(A.22) 



where A(t) := §^(n(t),t). The initial condition 



0/i(t) 



dX k 



(A.23) 



t=t k + 



is provided by the formula [i£ = fi(X^ + A&) whence ^ = CjJ". At the measurement times 
t = ti, I > k there is an additional multiplicative factor, which follows from the Jacobian 



drf _ dpf dX] 



M dX l 0^ 



(A.24) 
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The solution for ti > t > Z > fc, is 



3At 



Texp 



A(s) ds 



ff C/r-.Texp 

3=fe+l 



A(s) (is 



(A.25) 



Here Texp denotes the time-ordered exponential with matrices at increasing times to the left, 
and likewise II is the time-ordered product in the same sense. Thus, the final result 

i-i 



T exp 



A(s) ds 



II C+TT Texp 



tj-i 



A(s) (is 



(A.26) 



follows upon setting t = ij— . 

This may be compared with the solution of the adjoint equation 

d t a(t) + A*(t)a(t) = 



(A.27) 



integrated backward in time for t j^ti and subject to the jump conditions (|Q.50D at t = ti, I 
1, ...,n. The explicit solution for i& < t < tf.+i is 



a(t) = Texp 
l=k+l 



A* (si ds 



n, = , r 7-i C |-i-Tex P 



A*(s) ds 



i-i 



iy[/^-/V]- 

(A.28) 

Now Texp denotes anti-time-ordered exponential with matrices at decreasing times to the left, 
and II is the anti-time-ordered product. Setting t = and regrouping terms gives 



=fc+i 













/ A*(s) ds 


•TTC+jTexp 


/ A*(s) ds 






Jti-i 



Finally, substituting (|A.21| ),( |A2^ ) into (|A.20|) , and using (|A.29|) , gives 

<9F 



A** + («*) c fc = m ^ 



(A.29) 
(A.30) 



which is exactly the result required. 
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